Take-home Exercise 2

0. Goals & Objectives.

Our goal is to conduct a case study to demonstrate the potential value of Geospatial Data Science & Analysis, by integrating publicly available data from multiple sources to build a spatial interaction models, to determine factors affecting urban mobility patterns of public bus transit.

Recentdeployments of massively pervasive computing technologies such as in-vehicle GPS and SMART cards by public transport provide plenty of data for tracking movement patterns through space and time, but the explosive growth of data has outstripped public services’ ability to utilise, transform, and understand the data.

More detail about the task from: https://isss624-ay2023-24nov.netlify.app/take-home_ex02

Modifible Areal Unit Problem (MAUP): - MPSZ is too coarse, too huge a subzone area; people may live only in a corner of the subzone - Planning subzones too large, so we use analytical hexagon

“Map is not interesting, pattern revealed by and factors affecting the map is interesting. Can we explain this by building a spatial model?” - Building a model to explain flows; Spatial Interaction Model

GLM: Generalised Linear-Regression Model (over linear model):

  • Dwelling Units as proxy for population;
  • Can compare HDB-only VS dwelling units vs room-flat vs 1/2/3/ room flat unit
  • POI: points of interest name & type
Important Note

Due to the nature of EDA and Data Analysis, parts of this page have been Collapsed or placed behind tabs, to avoid excessive scrolling.

For easier reading, this study is also presented in point-form.

0.1 Dataset used

0.1.1 Aspatial Dataset

Dataset, Purpose & Source: Key columns

hdb.csv : HDB Property information data with geocoding

Via data.gov

  • lat, long: coordinates |
  • max_floors, total_dwelling_units : idea of how many units each HDB holds

origin_destination_bus_202310 : Volume of bus passenger trips, by origin/destination, data prepared for 2023 October

Via LTA DataMall

  • ORIGIN_PT_code, DESTINATION_PT_code: O/D information on bus trip
  • TIME_PER_HOUR, DAY_TYPE: Time/Date data on when the trips were taken
  • TOTAL_TRIPS: Bus trip passenger volume

0.1.2 Geospatial Dataset

Sample grid table. |
Filename, Purpose & Source: Key columns

hdb.csv : HDB Property information data with geocoding

Via data.gov

  • lat, long: coordinates |
  • max_floors, total_dwelling_units : idea of how many units each HDB holds

origin_destination_bus_202310 : Volume of bus passenger trips, by origin/destination, data prepared for 2023 October

Via LTA DataMall

  • ORIGIN_PT_code, DESTINATION_PT_code: O/D information on bus trip
  • TIME_PER_HOUR, DAY_TYPE: Time/Date data on when the trips were taken
  • TOTAL_TRIPS: Bus trip passenger volume

1. Geospatial Data Wrangling

This study was performed in R, written in R Studio, and published using Quarto.

1.1 Import Packages

This function calls pacman to load sf, tidyverse, tmap, knitr packages;

  • tmap : For thematic mapping; powerful mapping package;
  • sf : for geospatial data handling, but also geoprocessing: buffer, point-in-polygon count, etc;
  • sfdep : useful functions for creating weight matrix, LISA calculations etc;
  • tidyverse : for non-spatial data handling; commonly used R package and contains dplyr for dataframe manipulation and ggplot for data visualization;
  • DT: for displaying datatables;
  • leaflet: for custom layer controls over tmap visualisations.
pacman::p_load(tmap, sf, sfdep, tidyverse, DT, leaflet)

1.2 Import Data

1.2.1 Load Geospatial Bus Stop Data

  • First, we load BusStop shapefile data from LTA Datamall;
  • st_read() is used to import the ESRI Shapefile data into an sf dataframe.
    • We use
    • From previous take-home exercise, we know BusStop has the incorrect CRS (coordinate reference system), as EPSG 9001 instead of 3414
show code
bus_stop_sf <- st_read(dsn = "data/geospatial", 
                 layer = "BusStop") %>%
  st_set_crs(3414)
Reading layer `BusStop' from data source 
  `C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
show code
head(bus_stop_sf)
Simple feature collection with 6 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 13228.59 ymin: 30391.85 xmax: 41603.76 ymax: 44206.38
Projected CRS: SVY21 / Singapore TM
  BUS_STOP_N BUS_ROOF_N             LOC_DESC                  geometry
1      22069        B06   OPP CEVA LOGISTICS POINT (13576.31 32883.65)
2      32071        B23         AFT TRACK 13 POINT (13228.59 44206.38)
3      44331        B01              BLK 239  POINT (21045.1 40242.08)
4      96081        B05 GRACE INDEPENDENT CH POINT (41603.76 35413.11)
5      11561        B05              BLK 166 POINT (24568.74 30391.85)
6      66191        B03         AFT CORFE PL POINT (30951.58 38079.61)
show code
mrt_sf <- st_read(dsn = "data/geospatial", 
                 layer = "RapidTransitSystemStation") %>%
  st_set_crs(3414)
Reading layer `RapidTransitSystemStation' from data source 
  `C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
show code
st_is_valid(mrt_sf)
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[145]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[157]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    NA  TRUE  TRUE  TRUE  TRUE  TRUE
[169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[205]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[217]  TRUE FALSE  TRUE  TRUE
show code
st_crs(mrt_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
show code
head(mrt_sf)
Simple feature collection with 6 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 29286.74 ymin: 30548.91 xmax: 34623.54 ymax: 40980.53
Projected CRS: SVY21 / Singapore TM
  TYP_CD STN_NAM TYP_CD_DES              STN_NAM_DE
1      0    <NA>        MRT   ESPLANADE MRT STATION
2      0    <NA>        MRT  PAYA LEBAR MRT STATION
3      0    <NA>        MRT DHOBY GHAUT MRT STATION
4      0    <NA>        MRT      DAKOTA MRT STATION
5      0    <NA>        MRT    LAVENDER MRT STATION
6      0    <NA>        LRT     RENJONG LRT STATION
                        geometry
1 POLYGON ((30566.07 30621.21...
2 POLYGON ((34495.6 33384.44,...
3 POLYGON ((29293.51 31312.53...
4 POLYGON ((34055.08 32290.62...
5 POLYGON ((31236.5 32085.76,...
6 POLYGON ((34382.66 40949.64...
  • We check the coordinate reference system with st_crs(); and see that it is now indeed correctly set to 3414:
show code
st_crs(bus_stop_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
  • We see that although the projection is SVY21, the CRS (coordinate reference system) is EPSG 9001, which is incorrect;
show code
tmap_mode("plot")
tmap mode set to plotting
show code
qtm(bus_stop_sf)

1.2.2 Visualizing Within Singapore’s MPSZ Boundaries

  • This image is hard to read; we can vaguely discern Singapore’s Southern coastline, but it can be hard to visualize.
  • I have sourced and downloaded supplemental data about Singapore’s Administrative National Boundary (“SANB”) from igismap.com as a base layer for visualization;
    • We set tmap_mode("view") to allow us to scroll and confirm the SARB boundaries;
    • As before, we use st_read() to load the shapefile data, and st_transform() to ensure the projection is correct;
      • We use tm_shape() + tm_polygons() to map a grey layer of the SARB boundaries;
      • On top of which, we layer tm_shape() + tm_dots() to show the bus stops.
show code
mpsz_sf <- st_read(dsn = "data/geospatial", 
                 layer = "MPSZ-2019") %>%
  st_transform(crs=3414)
Reading layer `MPSZ-2019' from data source 
  `C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
show code
tmap_mode("view")
tmap mode set to interactive viewing
show code
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz_sf) +
  tm_polygons() + 
  tm_shape(bus_stop_sf) + 
  tm_dots()
Warning: The shape mpsz_sf is invalid (after reprojection). See sf::st_is_valid
show code
## Additional data from: Data.gov.sg, https://beta.data.gov.sg/datasets/d_02cba6aeeed323b5f6c723527757c0bc/view
  • We note there are a number of bus stops outside Singapore’s boundaries; Specifically, three bus stops in a cluster just outside the Causeway, and one further North.
  • We perform several steps to isolate & check the data;
    • we use st_filter() to find bus stops within Singapore’s Administrative National Boundaries, and create sg_bus_stop_sf for future use.
    • to check what bus stops have been dropped, we use anti_join() to compare raw_bus_stop_sf with sg_bus_stop_sf. We use st_drop_geometry on both sf dataframes to only compare the non-geometry columns.
show code
sg_bus_stop_sf <- st_filter(bus_stop_sf, mpsz_sf)
anti_join(st_drop_geometry(bus_stop_sf), st_drop_geometry(sg_bus_stop_sf))
Joining with `by = join_by(BUS_STOP_N, BUS_ROOF_N, LOC_DESC)`
  BUS_STOP_N BUS_ROOF_N            LOC_DESC
1      47701        NIL          JB SENTRAL
2      46239         NA          LARKIN TER
3      46609         NA     KOTARAYA II TER
4      46211        NIL JOHOR BAHRU CHECKPT
5      46219        NIL JOHOR BAHRU CHECKPT
  • We see there are in fact 5 bus stops outside of Singapore (including the defunct Kotaraya II Terminal) that have been removed, which means our data cleaning was correct.

1.3 Geospatial Data Cleaning

1.3.1 Removing Duplicate Bus Stops

  • But, do we need to do more data cleaning?
  • We use length() to find the total number of raw values in the $BUS_STOP_N column of sg_bus_stop_sf;
    • We then compare this to length(unique()) to find the unique values;
  • And, indeed, we find there are 16 bus stops that are repeated;
cat("Total number of rows in sg_bus_stop_sf\t\t: ", paste0(length(sg_bus_stop_sf$BUS_STOP_N)))
cat("\nTotal unique bus stop IDs in sg_bus_stop_sf\t: ", paste0(length(unique(sg_bus_stop_sf$BUS_STOP_N))))
cat("\nRepeated bus stops\t\t\t\t:   ", paste0(length(raw_bus_stop_sf$BUS_STOP_N) - length(unique(raw_bus_stop_sf$BUS_STOP_N))))
  • It appears there are 16 datapoints that are specifically repeated; let’s identify the bus stop numbers with repeated rows:
    • First we use filter() with a pipe mark (using or condition) to identify repeated bus stop numbers (i.e. $BUS_STOP_N);
    • We sort them in ascending order using arrange(); then, using group_by() and row_number() we add row numbers based on $BUS_STOP_N to a new column using mutate().
show code
repeated_df <- sg_bus_stop_sf %>%
  filter(duplicated(BUS_STOP_N) | duplicated(BUS_STOP_N, fromLast = TRUE)) %>% 
  arrange(BUS_STOP_N) %>%
  group_by(BUS_STOP_N) %>%
  mutate(RowNumber = row_number())

datatable(repeated_df)
  • From examination, there are 32 bus stops sharing 16 bus stop numbers – 16 pairs of bus stops sharing the same number.
  • Let’s examine these bus stop pairs on the map;
    • we use mapview() to display these repeated bus stops on the map;
    • we use col="BUS_STOP_N" with palette="Spectral to give each pair of bus stops an individual colour.
show code
tmap_mode("view")
tm_shape(sanb_sf) +
  tm_polygons() + 
  tm_shape(repeated_df) + 
  tm_dots(col = "BUS_STOP_N", palette = "Spectral")
  • After confirming with Prof Kam, we will simply drop the second instance of the rows.
    • we use duplicated() to identify rows with repeated values of $BUS_STOP_N; duplicated rows will return TRUE while all other rows will return FALSE
    • We use ! to invert the values, so only the unduplicated rows will return TRUE.
    • We then use square brackets [] to index sg_bus_stop_sf based on the rows, and return only the unduplicated rows;
show code
sg_bus_stop_sf <- sg_bus_stop_sf[!duplicated(sg_bus_stop_sf$BUS_STOP_N), ]
head(sg_bus_stop_sf)

I was unable to take Prof Kam’s Data Analytics Lab, but I know of his fervour and attention to detail. I believe in informed choices, and so I performed manual cleaning with the following steps:

  • Remove the rows with lowercase names, as most $LOC_DESC are in strict uppercase

  • Remove the rows with NA $LOC_DESC

  • Remove the row with NIL $LOC_DESC

  • For remaining rows, drop second entry

  • Retain remaining rows

After this, we just run the same steps on sg_bus_stop_sf or perform an anti-join.

However, after clarification with Prof Kam, we just drop the second entry. My code is shown below for the gratification of those who may enjoy it.

show code
drop_second_stop = c("43709", "51071", "53041", "52059", "58031", "68091", "68099", "97079")
rows_to_retain_df <- repeated_df %>%
  filter(
    case_when(
      BUS_STOP_N == "11009" & grepl("[a-z]", LOC_DESC) ~ FALSE,
      BUS_STOP_N == "22501" & grepl("[a-z]", LOC_DESC) ~ FALSE,
      BUS_STOP_N == "77329" & grepl("[a-z]", LOC_DESC) ~ FALSE,
      BUS_STOP_N == "82221" & grepl("[a-z]", LOC_DESC) ~ FALSE,
      BUS_STOP_N == "62251" & grepl("[a-z]", LOC_DESC) ~ FALSE,
      BUS_STOP_N == "96319" & grepl("[a-z]", LOC_DESC) ~ FALSE,

      BUS_STOP_N == "47201" & is.na(LOC_DESC) ~ FALSE,

      BUS_STOP_N == "67421" & BUS_ROOF_N == "NIL" ~ FALSE,
      BUS_STOP_N %in% drop_second_stop & RowNumber == 2 ~ FALSE,

      TRUE ~ TRUE
    )
  )

rows_to_retain_df$LOC_DESC  = toupper(rows_to_retain_df$LOC_DESC)

print("Printing rows to retain:")
rows_to_retain_df
  • If we run the steps from above, we can see that there are no repeated bus stops.
cat("Total number of rows in sg_bus_stop_sf\t\t: ", paste0(length(sg_bus_stop_sf$BUS_STOP_N)))
cat("\nTotal unique bus stop IDs in sg_bus_stop_sf\t: ", paste0(length(unique(sg_bus_stop_sf$BUS_STOP_N))))
cat("\nRepeated bus stops\t\t\t\t:   ", paste0(length(sg_bus_stop_sf$BUS_STOP_N) - length(unique(sg_bus_stop_sf$BUS_STOP_N))))

1.4 Generating Hexagon Maps

  • We use st_make_grid() with square = FALSE to create the hexagon layer as defined in the study, which we name raw_hex_grid;
    • We pass cellsize = 500 to create the hexagons of necessary size. Prof Kam defined the apothem as 250m, which means the distance between opposite edges is double that, or 500m;
      • I used units::as_units to pass 500 metres into the argument. I am still uncertain whether a length of 500m needs to be reprojected, or whether we need to do any further transformation.
  • We use st_sf() to convert raw_hex_grid into an sf dataframe;
    • mutate() is used here to create a grid_id column;
    • We just use st_transform() in case we need to reproject the coordinate system, just in case.
  • However, trying to visualize this right now just gives us a map full of hexagons:
show code
raw_hex_grid = st_make_grid(sg_bus_stop_sf, cellsize = units::as_units(500, "m"), what = "polygons", square = FALSE) %>%
  st_transform(crs = 3414)
# To sf and add grid ID
raw_hex_grid <- st_sf(raw_hex_grid) %>%
  # add grid ID
  mutate(grid_id = 1:length(lengths(raw_hex_grid))) %>%
  st_transform(crs = 3414)

tmap_mode("plot")
qtm(raw_hex_grid)
  • What we will need is to isolate only the hexagons with bus stops in them.
    • We use lengths(st_intersects()) to count the number of bus stops in each hexagon, and add that to a new column, $n_bus_stops ;
    • We then create hexagon_sf by filtering for only hexagons with non-zero bus stops in each.
  • We then plot these using the usual tmap() functions;
    • tm_basemap()is used to create a “basemap” layer underneath to orient our hexagons.
    • We used OneMapSG as our comprehensive map of Singapore; if you zoom in, you can actually count the number of bus stops in red on the map.
show code
# Count number of points in each grid, code snippet referenced from: 
# https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r

raw_hex_grid$n_bus_stops = lengths(st_intersects(raw_hex_grid, sg_bus_stop_sf))

# remove grid without value of 0 (i.e. no points inside that grid)
hexagon_sf = filter(raw_hex_grid, n_bus_stops > 0)
# head(hexagon_sf)

tmap_mode("view")
tm_basemap(providers$OneMapSG.Grey) + 
  tm_shape(hexagon_sf) +
  tm_fill(
    col = "n_bus_stops",
    palette = "-plasma",
    style = "cont",
    
    breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
    title = "Number of bus_stops",
    id = "grid_id",
    showNA = FALSE,
    alpha = 0.6,
    popup.vars = c(
      "Number of Bus Stops: " = "n_bus_stops"
    ),
    popup.format = list(
      n_bus_stops = list(format = "f", digits = 0)
    )
  ) +
  tm_borders(col = "grey40", lwd = 0.7)
  • There seem to be 2 regions of deep purple, centred over an area near, but not exactly over Pasir Ris and Choa Chu Kang MRTs.
  • We perform some simple stats to count the total number of filtered hexagons, and to see the maximum number of bus stops in a hexagon.
show code
cat(paste("Total number of raw hexagons is\t\t\t: ", nrow(raw_hex_grid), "\n"))
cat(paste("Total number of hexagons (n_bus_stop > 1) is\t: ", nrow(hexagon_sf)), "\n")

cat("\nPrinting map_hexagon_sf:\n >> ")
hexagon_sf[hexagon_sf$n_bus_stops > 10, ]
  • For the next step, we’ll need to manage the aspatial bus trips dataset, which is what we’ll work on now.

1.5 Aspatial Data Wrangling: Bus trip dataset

1.5.1 Import Bus O/D Dataset

  • For our purposes, we will focus only on 2023-October Passenger Volume by Origin Destination Bus Stops, downloaded from LTA DataMall;

  • We use read_csv() to load the data from the .csv file;

  • We use select() with a - sign to remove two columns redundant for our analysis:

    • $PT_TYPE column indicates the type of public transport, however, every value is “BUS”
    • $YEAR_MONTH column similarly has “2023-10” for every value, which we are aware of
    • With this in mind, we drop these two columns to save space.
  • Finally, we do as.factor() to convert two columns ($ORIGIN_PT_CODE and $DESTINATION_PT_CODE)from character to factor, for easier analysis.

show code
odbus_2310 <- read_csv("data/aspatial/origin_destination_bus_202310.csv")
odbus_2310 <- select(odbus_2310, -PT_TYPE, -YEAR_MONTH)
odbus_2310$ORIGIN_PT_CODE <- as.factor(odbus_2310$ORIGIN_PT_CODE)
odbus_2310$DESTINATION_PT_CODE <- as.factor(odbus_2310$DESTINATION_PT_CODE)

str(odbus_2310)
  • This is a huge tibble dataframe with over 5 million rows.

1.5.2 Filter For Peaks

  • We now perform a multi-step filter process;
    1. We combine mutate() with case_when() to create a new column, $PEAK, based on the study criteria:
      1. “WEEKDAY_MORNING_PEAK” if $DAY_TYPE is “WEEKDAY” and bus tap-on time (e.g. $TIME_PER_HOUR) is between 6 am and 9 am, inclusive;
      2. “WEEKDAY_AFTERNOON_PEAK” if $DAY_TYPE is “WEEKDAY” and bus tap-on time (e.g. $TIME_PER_HOUR) is between 5 pm and 8pm, inclusive;
      3. “WEEKEND_MORNING_PEAK” if $DAY_TYPE is “WEEKENDS/HOLIDAY” and bus tap-on time (e.g. $TIME_PER_HOUR) is between 11 am and 2pm, inclusive;
      4. “WEEKDAY_AFTERNOON_PEAK” if $DAY_TYPE is “WEEKENDS/HOLIDAY” and bus tap-on time (e.g. $TIME_PER_HOUR) is between 4 pm and 7pm, inclusive;
      5. Remaining values are assigned “Unknown” value.
    2. We then use filter() to eliminate those with “Unknown” $PEAK values, i.e. rows outside the period of interest for the study
    3. We use group_by() to group the values by $ORIGIN_PT_CODE and $PEAK, and use summarise() to aggregate the sum of $TOTAL_TRIPS as a new column, $TRIPS.
    4. Finally, we use pivot_wider() to pivot our long table into wide format, where each column represents one of the peak periods of interest in our study, and the values are filled from $TRIPS. We use the values_fill argument to fill in a value of “0” if the bus stop has a missing value.
    5. We use write_rds() to save the output dataframe, odbus_filtered, as an RDS object for future reference/load.
Note
  • Note that we convert the values for $TIME_PER_HOUR to 24-hour clock, e.g. “5pm” is “17” hundred hours, “8pm” is “20” hundred hours,.
  • We truncate “Weekend/Public Holiday” to “WEEKEND” for easier reading/reference.
odbus_filtered <- odbus_2310 %>%
  mutate(PEAK = case_when(
    DAY_TYPE == "WEEKDAY" & TIME_PER_HOUR >= 6 &  TIME_PER_HOUR <= 9 ~ "WEEKDAY_MORNING_PEAK",
    DAY_TYPE == "WEEKDAY" & TIME_PER_HOUR >= 17 &  TIME_PER_HOUR <= 20 ~ "WEEKDAY_AFTERNOON_PEAK",
    DAY_TYPE == "WEEKENDS/HOLIDAY" & TIME_PER_HOUR >= 11 &  TIME_PER_HOUR <= 14 ~ "WEEKEND_MORNING_PEAK",
    DAY_TYPE == "WEEKENDS/HOLIDAY" & TIME_PER_HOUR >= 16 &  TIME_PER_HOUR <= 19 ~ "WEEKEND_EVENING_PEAK",
    TRUE ~ "Unknown"
  )) %>%
  filter(
    case_when(
      PEAK == "Unknown" ~ FALSE,
      TRUE ~ TRUE
    )) %>%
  group_by(ORIGIN_PT_CODE, PEAK) %>%
  summarise(TRIPS = sum(TOTAL_TRIPS)) %>%
  pivot_wider(names_from = PEAK, values_from = TRIPS, values_fill = 0)


write_rds(odbus_filtered, "data/rds/odbus_filtered.rds")
head(odbus_filtered)

1.6 Combine Bus Trip Data With hexagon_sf Dataframe

  • For our study purposes, we need to have the number of bus trips originating from each hexagon. In order to achieve this, we must combine our three current dataframes:
    • hexagon_sf, an sf dataframe with $grid_id column (primary key) and the specific polygon geometry of each hexagon;
    • sg_bus_stop_sf, an sf dataframe with the $BUS_STOP_N (primary key) and the point geometry of each bus stop;
    • odbus_filtered, a tibble dataframe with the $ORIGIN_PT_CODE (primary key) column and the trip details for each of the four peak periods of interest.

1.6.1 Identify Hexagon grid_id For Each Bus Stop

  • First, we combine hexagon_sf with sg_bus_stop_sf ;
    • We use st_intersection to combine the dataframes such that each row of sg_bus_stop_sf has an associated grid_id;
    • We use select() to filter the resultant bus_stop_hexgrid_id dataframe to only $grid_id and $BUS_STOP_N columns, and use st_drop_geometry() to convert into a simple dataframe with just two columns:
show code
bus_stop_hexgrid_id <- st_intersection(sg_bus_stop_sf, hexagon_sf) %>%
  select(grid_id, BUS_STOP_N) %>%
  st_drop_geometry()

cat("\nNumber of bus stops\t:", length(unique(bus_stop_hexgrid_id$BUS_STOP_N)))
cat("\nNumber of hexgrids\t:", length(unique(bus_stop_hexgrid_id$grid_id)))

head(bus_stop_hexgrid_id)

1.6.2 Identify Bus Trip Details For Each Hexagon grid_id

  • Here, we again use multiple steps to generate bus trip details for each hexagon grid_id;
    1. We use left_join() to add the grid_id to each row of odbus_filtered, since each row has a unique single bus stop ID (i.e. $BUS_STOP_N);
    2. We use select() to retain only the grid_id and the four peak-trips columns;
    3. We combine group_by() and summarise() to aggregate the trips for each peak by grid_id.
  • Finally, we use head() to preview the grid_trips_df tibble dataframe.
show code
colnames(odbus_filtered)

grid_trips_df <- left_join(odbus_filtered, bus_stop_hexgrid_id, 
            by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  select(grid_id, 
         WEEKDAY_MORNING_PEAK,
         WEEKDAY_AFTERNOON_PEAK,
         WEEKEND_MORNING_PEAK,
         WEEKEND_EVENING_PEAK)  %>%
  group_by(grid_id) %>%
  summarise(
    WEEKDAY_MORNING_TRIPS = sum(WEEKDAY_MORNING_PEAK), 
    WEEKDAY_AFTERNOON_TRIPS = sum(WEEKDAY_AFTERNOON_PEAK), 
    WEEKEND_MORNING_TRIPS = sum(WEEKEND_MORNING_PEAK), 
    WEEKEND_EVENING_TRIPS = sum(WEEKEND_EVENING_PEAK)
    )
head(grid_trips_df)

1.6.3 Combine Bus Trip Details Back Into hexagon_sf

  • Finally, it’s time to recombine bus trip data back into hexagon_sf;
    • We use left_join on $grid_id to add trip data back into hexagon_sf;
    • We add a failsafe mutate() to replace any “NA” values for the columns.
show code
hexagon_sf <- left_join(hexagon_sf, grid_trips_df, 
            by = 'grid_id' ) %>%
  mutate(
    WEEKDAY_MORNING_TRIPS = ifelse(is.na(WEEKDAY_MORNING_TRIPS), 0, WEEKDAY_MORNING_TRIPS),
    WEEKDAY_AFTERNOON_TRIPS = ifelse(is.na(WEEKDAY_AFTERNOON_TRIPS), 0, WEEKDAY_AFTERNOON_TRIPS),
    WEEKEND_MORNING_TRIPS = ifelse(is.na(WEEKEND_MORNING_TRIPS), 0, WEEKEND_MORNING_TRIPS),
    WEEKEND_EVENING_TRIPS = ifelse(is.na(WEEKEND_EVENING_TRIPS), 0, WEEKEND_EVENING_TRIPS),
         )

head(hexagon_sf)

1.7 Exploratory Data Analysis Of Bus Trips, Across Peak Periods, By Hexagons

  • For ggplot, we need data in long format, so we can use gather() on grid_trips_df from Section 1.6.2 to pivot this;
  • We then pipe this into a geom_boxplot() for an exploratory look:
show code
gather(grid_trips_df, key = "Peak", value = "Trips", -grid_id) %>%
  ggplot( aes(x=Peak, y=Trips, fill=Peak)) +
  geom_boxplot() + 
  ggtitle("Boxplot: Trips over peak periods, 2023-Oct data") +
    xlab("") + 
    theme(
      legend.position="none"
      ) +
  coord_flip()
Tip

We also observe that number of trips for Weekday Morning & Weekday Afternoon seems to be larger than Weekend Morning and Weekend Evening peak trips. This is also confirmed by the figure in the next section.

This means that we will have to consider Weekday and Weekend peaks on different scales.

  • This is an exceptionally ugly plot, but it shows an important point: there is some serious right skew in our dataset;
  • Clearly, there are some hexagons with exceptionally high trips compared to the rest of the hexagons But, could this be because some hexagons have up to 11 bus stops, while others have 1 or 2?
  • We do a quick scatter plot based on $n_bus_stops to verify this:
show code
hexagon_sf %>% 
  st_drop_geometry() %>% 
  pivot_longer(cols = starts_with("WEEK"),
               names_to = "PEAK", values_to = "TRIPS") %>%
  ggplot( aes(x=TRIPS, y=n_bus_stops, color=PEAK, shape=PEAK)) +
  geom_point(size=2) +
  ggtitle("Scatterplot: Trips over peak periods by number of bus stops per hexagon, 2023-Oct data") +
    theme(legend.position=c(.85, .15),
          legend.background = element_rect(fill = "transparent"),
          legend.key.size = unit(0.5, "cm"),  
          legend.text = element_text(size = 6),
          legend.title = element_text(size = 8)
    ) 
  • Surprising results from our plot! If we consider those with > 100,000 trips as outliers, most of them come from hexagons with between 4-8 bus stops;
  • There is some correlation between number of bus stops and high numbers of trips, but a stronger factor is peak time; Weekday Morning peak trips, followed by Weekday Afternoon peak trips, contribute to the largest outliers.
Tip

I note that these visualizations only scrape the surface of understanding the data. However, this is not the focus of our study; we do these quick visualizations only to provide better context for our study.